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Abstract Statistical causal inference from observational studies often requires ad¬ 
justment for a possibly multi-dimensional variable, where dimension reduction is 
crucial. The propensity score, first introduced by Rosenbaum and Rubin, is a popular 
approach to such reduction. We address causal inference within Dawid’s decision- 
theoretic framework, where it is essential to pay attention to sufficient covariates 
and their properties. We examine the role of a propensity variable in a normal linear 
model. We investigate both population-based and sample-based linear regressions, 
with adjustments for a multivariate covariate and for a propensity variable. In ad¬ 
dition, we study the augmented inverse probability weighted estimator, involving 
a combination of a response model and a propensity model. In a linear regression 
with homoscedasticity, a propensity variable is proved to provide the same estimated 
causal effect as multivariate adjustment. An estimated propensity variable may, but 
need not, yield better precision than the true propensity variable. The augmented 
inverse probability weighted estimator is doubly robust and can improve precision 
if the propensity model is correctly specified. 
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1 Introduction 

Causal effects can be identified from well-designed experiments, such as ran¬ 
domised controlled trials (RCT), because treatment assignment is entirely unrelated 
to subjects’ characteristics, both observed and unobserved. Suppose there are two 
treatment arms in an RCT: treatment group and control group. Then the average 
causal effect (ACE) can simply be estimated as the outcome difference of the two 
groups from the observed data. However, randomised experiments, although ideal 
and to be conducted whenever possible, are not always feasible. For instance, to in¬ 
vestigate whether smoking causes lung cancer, we cannot randomly force a group of 
subjects to take cigarettes. Moreover, it may take years or longer for development of 
this disease. Instead, a retrospective case-control study may have to be considered. 
The task of drawing causal conclusion, however, becomes problematic since sim¬ 
ilarity of subjects from the two groups will rarely hold, e.g., lifestyles of smokers 
might be different from those of non-smokers. Thus, we are unable to “compare like 
with like” - the classic problem of confounding in observational studies, which may 
require adjusting for a suitable set of variables (such as age, sex, health status, diet). 
Otherwise, the relationship between treatment and response will be distorted, and 
lead to biased inferences. In general, linear regressions, matching or subclassifica¬ 
tion are used for adjustment purpose. If there are multiple confounders, especially 
for matching and subclassification, identifying two individuals with very similar val¬ 
ues of all confounders simultaneously would be cumbersome or impossible. Thus, it 
would be sensible to replace all the confounders by a scalar variable. The propensity 
score i22\ is a popular dimension reduction approach in a variety of research fields. 


2 Framework 

The aim of statistical causal inference is to understand and estimate a “causal ef¬ 
fect”, and to identify scientific and in principle testable conditions under which the 
causal effect can be identified from observational studies. The philosophical nature 
of “causality” is reflected in the diversity of its statistical formalisations, as exem¬ 
plified by three frameworks: 

1. Rubin’s potential response framework Il24ll^l26l (also known as Rubin’s causal 
model) based on counterfactual theory; 

2. Pearl’s causal framework muni richly developed from graphical models; 

3. Dawid’s decision-theoretic framework (61171 based on decision theory and prob¬ 
abilistic conditional independence. 

In Dawid’s framework, causal relations are modelled entirely by conditional proba¬ 
bility distributions. We adopt it throughout this chapter to address causal inference; 
the assumptions required are, at least in principle, testable. 

Let X, T and Y denote, respectively, a (typically multivariate) confounder, treat¬ 
ment, and response (or outcome). For simplicity, T is a scalar and X a multi- 
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dimensional variable. We assume that T is binary: 1 (treatment arm) and 0 (control 
arm). Within Dawid’s framework, a non-stochastic regime indicator variable Fj, 
taking values 0, 0 and 1, is introduced to denote the treatment assignment mecha¬ 
nism operating. This divides the world into three distinct regimes, as follows: 

1. Ft = %'■ the observational (idle) regime. In this regime, the value of the treatment 
is passively observed and treatment assignment is determined by Nature. 

2. Ft = I', the interventional treatment regime, i.e., treatment T is set to 1 by ma¬ 
nipulation. 

3. Ft = 0: the interventional control regime, i.e., treatment T is set to 0 by manip¬ 
ulation. 

For example, in an observational study of custodial sanctions, our interest is in the 
effect of custodial sanction, as compared to probation (noncustodial sanction), on 
the probability of re-offence. Then TV = 0 denotes the actual observational regime 
under which data were collected; TV = 1 is the (hypothetical) interventional regime 
that always imposes imprisonment; and TV = 0 is the (hypothetical) interventional 
regime that always imposes probation. Throughout, we assume full compliance and 
no dropouts, i.e., each individual actually takes whichever treatment they are as¬ 
signed to. Then we have a joint distribution Pf of all relevant variables in each 
regimeTV =/(/ = 0,1,0). 

In the decision-theoretic framework, causal assumptions are construed as asser¬ 
tions that certain marginal or conditional distributions are common to all regimes. 
Such assumptions can be formally expressed as properties of conditional indepen¬ 
dence, where this is extended to allow non-stochastic variables such as TV umaci. 
For example, the “ignorable treatment assignment” assumption in Rubin’s causal 
model (RCM) ll22l can be expressed as 

Y1LFt\T, ( 1 ) 

read as “T is independent of TV given T”. Flowever, this condition will be most 
likely inappropriate in observational studies where randomisation is absent. 

Causal effect is defined as the response difference by manipulating treatment, 
which purely involves interventional regimes. In particular, the population-based 
average causal effect (ACE) of the treatment is defined as: 

ACE:=E(y|TV = l)-E(y|TV =0), (2) 

or alternatively, 

ACE:=Ei(y)-Eo(yB (3) 

Without further assumptions, by its definition ACE is not identifiable from the 

observational regime. 


* For convenience, the values of the regime indicator Fj are presented as subscripts. 
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3 Identification of ACE 

Suppose the joint distribution of {Fj, T, F ) is known and satisfies (dJ. Is ACE iden¬ 
tifiable from data collected in the observational regime? Note that ([T]) demonstrates 
that the distribution of Y given T = t is the same, whether t is observed in the ob¬ 
servational regime Ft = 0, or in the interventional regime Ft = t. As discussed, this 
assumption would not be satisfied in observational studies, and thus, direct compar¬ 
ison of response from the two treatment groups cannot be interpreted as the causal 
effect from observational data. 

Definition 1. The “face-value average causal effect” (FACE) is defined as: 

FACE:=E0(F|7’ = 1)-E0(F|7’ = O). (4) 

It would be hardly true that FACE = ACE, as we would not expect the conditional 
distribution of Y given T = t is the same in any regime. In fact, identification of 
ACE from observational studies requires, on one hand, adjusting for confounders, 
on the other hand, interplay of distributional information between different regimes. 
One can make no further progress unless some properties are satisfied. 


3.1 Strongly sufficient covariate 

Rigorous conditions must be investigated so as to identify ACE. 

Definition 2. X is a covariate if: 

Property L 

XIFFt. 

That is, the distribution of X is the same in any regime, be it observational or 
interventional. In most cases, X are attributes determined prior to the treatment, for 
example, blood types and genes. 

Definition 3. X is a sufficient covariate for the effect of treatment T on response Y 
if, in addition to Property [1] we have 

Property 2. 

Y1FFt\{XJ). 

Property |2] requires that the distribution of Y, given X and T, is the same in 
all regimes. It can also be described as “strongly ignorable treatment assignment, 
given X" fTlX . We assume that readers are familiar with the concept and properties 
of directed acyclic graphs (DAGs). Then Properties [1] and |2 can be represented by 
means of a DAG as FiglT] The dashed arrow fromX to T indicates that T is partially 
dependent on X, i.e., the distribution of T depends on X in the observational regime, 
but not in the interventional regime where Ft =t. 
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Definition 4 . X is a strongly sufficient covariate if, in addition to Properties [T] and 
|2] we have 

Property 5. P0(7’ = f | X) > 0 with probabilility 1, for f = 0,1. 

Property [3 requires that, for any X = x, both treatment and control groups are 
observed in the observational regime. 

Lemma 1. Suppose X is a strongly sufficient covariate. Then, considered as a joint 
distributions for (Y,X,T), P; is absolutely continuous with respect to P© (denoted 
byPt = 0andt = 1. That is, for every event A determined by (X,T,Y), 

P0(A)=O ^ Pr(A)=0. (5) 

Equivalently, if an event A occurs with probability 1 under the measure P®, then it 
occurs with probability 1 under the measure Pf (t = 0,1). 

Proof. Property m expressed equivalently as [Y,X,T)\EFt\{X,T), asserts that 
there exists a function w{X,T) such that 

Pf{A\X,T) = w{X,T) 

almost surely (a.s.) in each regime / = 0,1,0. Let P0(A) = 0. Then a.s. [P©], 

0 = P0(A I X) = w{X,1)Pq{T = 1 I X) + w{X,Q)Pq{T = 0 | X). 

By Property [3] for f = 0,1, 

w{X,t) = 0 (6) 

a.s. [P0]. As w{X,t) is a function of X, it follows that (|6]l holds a.s. [P;] by Property 
[T] Consequently, 

w{X,T)=0 a.s. [P,], (7) 

since a.s. [P;], T = t and w{X,T) = w{X,t) for any bounded function w. Then by 

0 , 

P,(A) = E,{P,(A I X, T)} = E,{w{X,T)} = 0. 

Lemma 2. For any integrable Z f:^{Y,X,T), and any versions of the conditional 
expectations, 

E,{Z\X)=EtiZ\X,T) [P,]. 

^ The ^ symbol is interpreted as “a function of’. 


(8) 
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Proof. Let y (X, L) be an arbitrary but fixed version of E, (Z | X, L). Then j{X,T) = 
j{X,t) a.s. [P;], and j{X,t) serves as a version of E, (Z \X,T) under [P,]. So 

E,{Z\X)=Er{j{XJ)\X]=E,{j{X,t)\X] = j{X,t) a.s. [P,]. 

Thus j{Xf ) is a version of E,(Z | X) under [P;] and ® follows. 

Since E,(Z | Z) is a function of X, then by Property [T] j{X,t) is a version of 
E,(Z I Z) in any regime. Let g{X,T) be some arbitrary but fixed version of E 0 (Z | 
X,T). 

Theorem 1. Suppose that X is a strongly sufficient covariate. Then for any inte- 
grable Z ^ {Y,X,T), and with notation as above, 

j{Xf)=g{X,t) (9) 


almost surely in any regime. 


Proof. By Property |2] there exists a function h{X,T) which is a common version of 
E/(Z I X,T) under [P/^] for / = 0,1,0. Then h{X,T) serves as a version of E 0 (Z | 
X,T) under [Pg], and a version of E,(Z | Z, T) under [P(]. As j{X,T) is a version of 

E,(z|z,r), 

j{X,T)=h{XJ) a.s. [P,], 


and consequently 


j{X,t)=h{Xf) a.s. [P,]. 


Since j{X,t) and liiXf) are functions ofZ, by Property[T] 


j{Xf)=h{Xf) a.s. [P/] (10) 

for / = 0,1,0. We also have that g{X,T) = h{X,T) a.s. [Pg], and so, by Lemma[T] 
a.s. [P;]. Then g{X,t) = h{Xf) a.s. [P,j, where g{Xf) and h{X,t) are both func¬ 
tions of Z. By Property [H 


g{Xf)=h{Xf) a.s. [Pf] (11) 

for / = 0,1,0. Thus (|9|l holds by (fTOl i and (fTTT i. 


3.2 Specific causal effect 

Let Z be a covariate. 

Definition 5. The specific causal effect of T on Y, relative to X, is 


SCE;=Ei(T |Z)-Eo(T |Z). 
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We annotate SCE^ to express SCE as a function of X and write SCE(x) to indicate 
that X takes specific value x. Because it is defined in the interventional regimes, 
SCE has a direct causal interpretation, i.e., SCE(;ic) is the average causal effect in 
the subpopulation with X =x. 

Although we do not assume the existence of potential responses, when this 
assumption is made we might proceed as follows. Take X to be the pair Y = 
(y(l),y(0)) of potential responses—which is assumed to satisfy Property [T| Then 
Et{Y I Y) =Y{t), and consequently 

SCEy = T(1)-T(0), 

which is the definition of “individual causal effect”, ICE, in Rubin’s causal model. 
Thus, although the formalisations of causality are different, SCE in Dawid’s deci¬ 
sion theoretic framework can be reagarded as a generalisation of ICE in Rubin’s 
causal model. 

We can easily prove that, for any covariate X, ACE = E(SCEy), where the ex¬ 
pectation may be taken in any regime. Since by Property [T] 

E0{E,(T|Y)}=E,{E,(T|Y)} = E,(T), 

for f = 0,1. Thus by subtraction, ACE = Ey(SCEx) for any regime / = 0,1,0 and 
therefore the subscript / can be dropped. Hence, ACE is identifiable from observa¬ 
tional data so long as SCEx is identifiable from observational data. If Y is a strongly 
sufficient covariate, by Theorem [T] E,(T | Y) is identifiable from the observational 
regime. It follows that SCE can be estimated from data purely collected in the ob¬ 
servational regime. Then ACE expressed as 

ACE = E0(SCEx) (12) 

is identifiable, from the observational joint distribution of {X,T,Y). Eormula (fT^ is 
Pearl’s “back-door formula” El because by the property of modularity, P(Y) is the 
same with or without intervention on T and thus can be taken as the distribution of 
X in the observational regime. 


3.3 Dimension reduction of strongly sufficient co variate 

Suppose Y is a multi-dimensional strongly sufficient covariate. The adjustment pro¬ 
cess might be simplified if we could replace X by some reduced variable PAY, 
with fewer dimensions—so long as V is itself a strongly sufficient covariate. Now 
since P is a function of X, Properties[T]and|3]will automatically hold for P. We thus 
only need to ensure that P satisfies Property|2] that is. 


Y1XFt\{VJ). 


( 13 ) 
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Since two arrows initiate from X in FiglT] possible reductions may be naturally 
considered, on the pathways from X to T, and from X to Y. Indeed, the following 
theorem gives two alternative sufficient conditions for (fOl l to hold. However, ( fTSl l 
can still hold without these conditions. 

Theorem 2. Suppose X is a strongly sufficient covariate and V ^X. Then V is a 
strongly sufficient covariate if either of the following conditions is satisfied: 

(a). Response-sufficient reduction: 


YlLX\{V,FT = t), 

(14) 

YllJ(\iV,TffiT = ®), 

(15) 


for f = 0,1. It is indicated in ( 1741 ) that, in each interventional regime, X con¬ 
tributes nothing towards predicting Y once we know V. In other words, as long 
as V is observed, X need not be observed to make inference on Y. While (US 
implies that in the observational regime, knowing X is of no value of predicting 
Y ifV and T are known. 

(b). Treatment-sufficient reduction: 

T11JC\{V,Ft=%). (16) 

That is, in the obser\’ational regime, treatment does not depend on X conditioning 
on the information ofV. 

Proofs of the above reductions were provided in ||9l- An alternative proof of |(b)| 
can be implemented graphically 0, which results in a DAG as Fig. I2PI off which 
(flbl l and (fl^ can be directly read. 

Fig. 2 Treatment sufficient 
reduction 



A graphical approach to |(a)| does not work since Property[3]is required. However, 
while not serving as a proof, Fig. [^conveniently embodies the conditional indepen¬ 
dencies Properties[T]|2and the trivial property V 1LT\{X ,Ft), as well as (lOT l. 


^ The hollow arrow head, pointing from X to V, is used to emphasise that F is a function of X. 
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Fig. 3 Response sufficient 
reduction 



4 Propensity analysis 


Here we further discuss the treatment-sufficient reduction, which does not involve 
the response. This brings in the concept of propensity variable: a minimal treatment- 
sufficient covariate, for which we investigate the unbiasedness and precision of the 
estimator of ACE. Also the asymptotic precision of the estimated ACE, as well as 
the variation of the estimate from the actual data, will be analysed. In a simple nor¬ 
mal linear model that applied for covariate adjustment, two cases are considered; 
homoscedasticity and heteroscedasticity. A non-parametric approach - subclassi¬ 
fication will also be conducted, for different covariance matrices of X of the two 
treatment arms. The estimated ACE obtained by adjusting for multivariate X and by 
adjusting for a scalar propensity variable, will then be compared theoretically and 
through simulations 121. 


4.1 Propensity score and propensity variable 

The propensity score (PS), first introduced by Rosenbaum and Rubin, is a balancing 
score l2^ . Regarded as a useful tool to reduce bias and increase precision, it is a 
very popular approach to causal effect estimation. PS matching (or subclassification) 
method, widely used in various research fields, exploits the property of conditional 
(within-stratum) exchangeability, whereby individuals with the same value of PS 
(or belonging to a group with similar values of PS) are taken as comparable or 
exchangeable. We will, however, mainly focus on the application of PS within a 
linear regression. The definitions of the balancing score and PS given below are 
borrowed from ll22l . 

Definition 6. A balancing score b{X) is a function of X such that, in the observa¬ 
tional regime 0, the conditional distribution of X given b{X) is the same for both 
treatment groups. That is. 


Xll.T\{b{X),FT =%). 

^ Rosenbaum and Rubin do not define the balancing score and the PS explicitly for observational 
studies, although they do aim to apply the PS approach in such studies. 
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It has been shown that adjusting for a balancing score rather than X results in unbi¬ 
ased estimate of ACE, with the assumption of strongly ignorable treatment assign¬ 
ment II 22 II . One can trivially choose b{X) = X, but it is more constructive to find a 
balancing score to be a many to one function. 

Definition 7. The propensity score, denoted by FI, is the probability of being as¬ 
signed to the treatment group given X in the observational regime: 

fT:-P0(r = l |X). 

We shall use the symbol n to denote a particular realisation of FI. By ( fT6b and 
Definitions |6] and I 2 ] we assert that PS is the coarsest balancing score. For a subject 
i, PS is assumed to be positive, i.e., 0 < tt, < 1. Those with the same value of PS are 
equally likely to be allocated to the treatment group (or equivalently, to the control 
group), which provides observational studies with the randomised-experiment-like 
property based on measured X. This is because the characteristics of the two groups 
with the same or similar PS are “balanced”. Therefore, the scalar PS serves as a 
proxy of multi-dimensional variable X, and thus, it is sufficient to adjust for the for¬ 
mer instead of the latter. In observational studies, PS is generally unknown because 
we do not know exactly which components of X have impact on T and how the treat¬ 
ment is associated with them. However, we can estimate PS from the observational 
data. 

PS analysis for causal inference is based on a sequence of two stages: 

Stage 1: PS Estimation. It is estimated by the observed T and X, and normally 
by a logistic regression of T on X for binary treatment. Note that the response Y is 
irrelevant at this stage. Because we can estimate PS without observing Y, there is no 
harm in finding an ’’optimal” regression model of T on X by repeated trials. 

Stage 2: Adjusting for PS. Various adjustment approaches have been developed, 
e.g., linear regression. If we are unclear about the conditional distribution of Y given 
T and PS, non-parametric adjustment such as matching or subclassification could 
be applied instead. 

Although two alternatives for dimension reductions have been provided, in prac¬ 
tice, this type of reduction may be more convenient in many cases. For example, 
certain values of the response may occur rarely and only after long observation 
periods after treatment. In addition, it may sometimes be tricky to determine a ’’cor¬ 
rect” form for a regression model of T on X, T and Fj. Swapping the positions of X 
and T, Equation (fTbl l can be re-expressed as 

X1LT\{V,Ft = (!)), (17) 

which states that the observational distribution of X given V is the same for both 
treatment arms. That is to say, V is a balancing score forX. 

The treatment-sufficient condition |(b)| can be equivalently interpreted as follows. 
Consider the family ^ = {Qo,Q\} consisting of observational distributions of X 
for the two groups T = 0 and T = 1. Then Equation (fTbl l. re-expressed as (fTTl i. says 
that y is a sufficient statistic (in the usual Fisherian sense [S]) for this family. In par- 
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ticular, a minimal treatment-sufficient reduction is obtained as a minimal sufficient 
statistic for i.e., any variable almost surely equal to a one-one function of the 
likelihood ratio statistic A := qi{X)/qo{X), where qi{-) is a version of the density 
of a . 

Definition 8. A propensity variable is a minimal treatment-sufficient covariate, or a 
one-one function of the likelihood ratio statistic A. 

The concept of a propensity variable is derived from PS which is related to A in 
the following way: 


fT = P0(7’= 1 |X) = 0A/(l-0 + 0A), (18) 

where 0 < 0 := P0(7’ = 1) < 1 by Property|3] 

It is entirely possible, from the above discussion, that a different propensity vari¬ 
able will be obtained if we start from a different strongly sufficient covariate. 


4.2 Normal linear model (homoscedasticity) 

The above theory will be illustrated by a simple example under linear-normal ho- 
moscedastic parametric assumptions. 


4.2.1 Model construction 

Suppose we have a scalar response variable Y, and a (p x 1) strongly sufficient 
covariate X that satisfies Properties [T]|2]and|3] Let the conditional distribution of T 
given {X,T,Ft) be specified as: 

Y\{X,T,FT)^.yF{d + 5T + b'X,(l>), (19) 

where the symbol ~ stands for“is distributed as” and the symbol ^ stands for 
normal distribution, with parameters d and 5 (scalar), b{px 1), and (j) (scalar). Note 
that here and in the following models, we assume no interactions between variables 
in X although interactions can be formally dealt with via dummy variables. Suppose 
X is a strongly sufficient covariate, then the coefficient 5 of T in (fT9]) is the average 
causal effect ACE, which can be easily proved as follows. 

ACE = E(SCEx) = E{Ei (T | X)} -E{Eo(T | X)} 

= E{d + 5 + b'X)-E{d + b'X) = 5 by ([T9|). 

It is readily seen that the specific causal effect SCEx is a constant and equals 5. 

Erom ( fT9] l. the linear predictor LP := b'X satisfies the conditional independence 
properties in Condi tion|(a)|of Theorem[2] Thus, LP is a response-sufficient reduction 
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of X, and E(Y \ LP, T) = d + dT + LP, with coefficient 5 of T that does not depend 
on the regime by virtue of the sufficiency condition. 

Now assume that our model for the observational distribution of {T,X) is as 
follows: 

P0(7’ = l) = 0 (20) 

X\iT,FT = <d)r^^{^T,^) ( 21 ) 

with parameters 9 G (0,1), jiaip x 1), {p x 1), and covariance matrix E (p x p, 
positive definite, identical in the two treatment groups). The corresponding marginal 
distribution of is a multivariate normal mixture 

X\FT=(d^il-e)^{po,^) + 9.J^iPuE), (22) 

in the observational regime, and because we have assumed Property [T]to hold, also 
in the interventional regime. The observational distribution of T given X is given by 
(fTSl) . with 


logA = log{P 0 (X I T = l)}-log{P 0 (X I r =0)} 

= -i(M[2:-Vi-Mo2:^Vo)+LD, (23) 

where 

LD := /X, (24) 

with 

7(Ml-Mo). (25) 

LD is Fisher’s linear discriminant Ha, best separating the pair of multivariate nor¬ 
mal observational distributions for X \ T = 0 and X \ T = 1. 

Suppose y is a linear sufficient covariate - a linear function of X that is itself a 
sufficient covariate. We have proved that the coefficient of T in the observational 
linear regression of T on T and V is 513 . From (|2^ we see that LD is a propensity 
variable which is a linear strongly sufficient covariate. We deduce that under the 
given distributions, the coefficient of T in the observational regression of T on T 
and LD is 5. 

Theorem 3. The coefficient of T in the linear regression ofY on (TjLD) is the same 
as that in the linear regression ofY on {T,X). 

Theorem [3 states that it is algebraically true that X and Fisher’s linear discriminant 
LD generate identical coefficient of T in linear regressions, which does not have a 
direct link to the regimes and causality whatsoever. In our linear normal model, 5 is 
interpreted as ACE and can be identified from the observational data simply because 
we have assumed thatX is a strongly sufficient covariate. Applying Theorem|3]to the 
empirical distribution of (T, T,X) from a sample, we deduce Corollary [1] as follows. 

Corollary 1. Suppose we have data on (Y,T,X) for a sample of individuals. Let 
LD* be the sample linear discriminant for T based on X. Then the coefficient of T 
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in the sample linear regression ofY on T and LD* is the same as that in the sample 
linear regression ofY on T andX. 

Rosenbaum and Rubin fT2\ (§ 3.4) also give this result with a brief non-causal ar¬ 
gument: whenever the sample dispersion matrix is used in both the form of LD and 
regression adjustment, the estimated coefficient of T must be the same. 

As discussed El, here is a paradox: we regard adjustment for the propensity 
variable as an adjustment for the treatment assignment process, by regressing F on L 
and the estimated propensity variable LD*. However, from the result of Corollary[Tl 
it appears that what we actually adjust for is the full set of covariates X, which makes 
the treatment assignment process completely irrelevant. 


4.2.2 Precision in propensity analysis 

One might intuitively think that the precision of the estimated ACE would be im¬ 
proved if we were to adjust for a scalar variable — the sample-based propensity vari¬ 
able LD*, rather than p-dimensional variable X. However, Corollary [T] tells us that 
adjusting for LD* does not increase the precision of our estimator. In fact, whether 
one adjusts for LD* and for all the p predictors makes absolutely no difference to 
our estimate, and thus, to its precision. Similar conclusions have been drawn in 
|[IOl[33l[3l]. Our intuition is that the increased precision obtained by regressing on 
V is offset by the overhtting error involved in selecting V. 

Previous evidence 11211 flTl l28ll supports the claim that the estimated propensity 
variable outperforms the true propensity variable. That is, adjusting for the former 
yields higher precision of the estimated ACE than the latter. These two types of 
adjustment correspond to regressing Y on (TjLD) and on (TjLD*) in our model 
and both provide an unbiased estimator of ACE. The claim obviously cannot be al¬ 
ways valid by simply considering a special case: LD = LP, because by Corollary[Tl 
regressing on LD* is the same as adjusting for LP*, which by the Gauss-Markov the¬ 
orem will be less precise than regressing on the true linear predictor LP (or equiva¬ 
lently LD). Nevertheless, the claim is likely to hold when LD is not highly correlated 
with LP because LD is a less precise response predictor. 


4.2.3 Asymptotic variance analysis 

To gain a closer insight into the variance of the estimated ACE, by adjusting for the 
true propensity variable PV (if known) and the estimated propensity variable EPV, 
we consider a toy example in which the parameters in (fT9l l. (l20l) and (l2T1i are set as 
follows: 

p = 2, P0(7’ = l) = 0e(O,l), b = {bub2y, 

the covariance matrix E is diagonal with identical entries T, and 


E(A2|7’ = 1)=E(A2|7’ = 0)=E(A2) 


(26) 
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By the setting of £, we see that 


^1^2 I T. (27) 

It is also clear that the true PV is just Xi, by minimal treatment-sufficient reduc¬ 
tion and related equations (|2^ . (l24l) . (l25T l. The conditions according to our model 
setting are expressed by a DAG as Fig. |4] 



Fig. 4 Propensity variable Xi and response predictor X = {Xi,X 2 ) 


In practice, all the parameters are unknown, and consequently the exact form of 
PV is not known. What one would normally do is adjust for the whole set of the 
observed X, which is equivalent to adjusting for LD* (or EPV) in the linear regres¬ 
sion approach by Corollary[T] In particular, two linear regressions are considered as 
follows: 

Mq: Y on (T,X), 

Mi'.Y on(T,Xi). 

Then the design matrix is (l,r,Xi,X 2 )' for Mq and (1,7’,Xi)' forMi. Let Pmq 
and pMi, respectively, be the least square estimators of the parameters in Mq and 
Ml. The asymptotic variance of Pmq for sample size n is then given as: 

_ A-'Var(T I r,X) _ 

n n ’ 


E{Xi) E{X2) \ 

E(TXi) E{TX2) 

E{Xi^) E(XiX2) • 

E{XiX2) E{X2^) 

By solving and extract the (2, 2)th element which is variance multiplier of 
the coefficient of T, we have that 




where 


A = 


/ 1 0 

0 0 

E{Xi)E{TXi) 
\E{X2) E{TX2) 


Var.asy(5Mo) 


(WxiX,Wx2X2-Wx,X2^)(^ 

n9(l — 0)(VViX, VX 2 X 2 - ’ 
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where 


VxiX2 = 


= E(XiX 2 ) - E(Xi)E(X 2 ) = Cov(Xi ,^ 2 ), 

E(XiX 2) - 9E(Xi I T = 1)E(X2 | T = 1) - (1 - 0)E(Xi | T = 0)E(X2 | T = 0), 
Wx,x, = E(Xi2) - [E(Xi)] 2 = Var(Xi), 

WX2X2 = E{X2^) - [E(X 2)]2 = Var(X2), 


Vx.x, = E{Xi^) - 0[E(Xi I T = 1)]2 - (1 - 0)[E(Xi I T = 0)]^, 
= E{X2^) - e[E{X2 I T = 1)]2 - (1 - 0)[E(X2 I T = 0)]2. 


By (EUl, 


EX 2 X 2 = Var(X2) = V16f2X2 


and 

ExiX 2 = Cov(Xi,X2) = WxiX2, 


where, by ( |27] ). 

Cov(Xi,X 2 ) =E{Cov(Xi I T,X 2 I 7’)}+Cov{E(Xi | 7’),E(X2 | E)} =0. 


Hence, 


V,, =_ 0Var(Xi)/[n0(l-0)] _ 

■asy{ Mo) e(Xi2)-0[E(Xi | E = 1)]2 - (1 - 0)[E(Xi | E = 0)]2' 


(28) 


Eor M \, by (|27] |. 


Var aiy(5M|) 




«0(i — 0)yxiXi 


•Var(y|E,Xi) 

•{0+fc2^Var(X2|E,Xi)} 


n0(l — 0)VxiXi 

_ (0+fc2^T)Var(Xi)/[w0(l-0)] 

E(Xi2)-0[E(Xi |E = 1)]2-(1-0)[E(Xi |E=0)] 


2 ■ 


(29) 


Comparing ( |28] ) and ( |29] |, we have that Var asy(5Mo) < Var.asy(5Mi) unless X 2 
is random noise rather than the linear predictor i.e., b 2=0 which equalises the two 
asymptotic variances. 


Lemma 3. Under the given distributional assumptions t l79l ), ( 1201 ) and ( 1271 ). suppose 
the propensity variable LD is not the same as the linear predictor LP, and ED 
is independent of variables that are merely response predictors. Then the asymp¬ 
totic variance of the estimated ACE from the linear regression by adjusting for the 
estimated propensity variable LD* is more precise than that by adjusting for the 
population propensity variable LD. 
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4.2.4 Simulations 

Simulations are carried out for numerical illustration. Suppose we have the follow¬ 
ing true values for the paramters in ( fT^ . (l 20 l) and ( 1211 1 : p = 2 ,d = 0,8 = Q.5,b = 

(O,iy,0 = 1,0 =0.5,Ml = (1,0)',Mo = (0,0)', Z =/ 2 . 

Then the population linear predictor is LP = X 2 , with 
Y\{X,T,Ft)-^^{^T+X2,1), 

while the population linear discriminant LD = Xi which is not predictive to Y. Since 
for any regime / = 0 , 1 , 0 , 


Ef{Y \XuT) = Ef{Ef{Y \X,T)\XuT}=^-T 


and 

Var/(y |Xi,7’) = E/{Var/(F |X,7’) |Xi,7’}+Var{E/(F |X,7’) |Xi,7’} = 2. 
The conditional distribution of Y given (2fi ,T), for any regime, is then given by 

y I (Xi,7’,Er)^^(ir,2). 

To investigate the performance of the population-based as well as sample-based 
LP and PV, we now consider four linear regression models: 

Mq:Y on T and X {X = (Xj ,^ 2 )), 

M\:Y on r andXi, 

M 2 : Y on T and X 2 , 

M 3 : y on r and LD*, 

where Mq is the full model with all parameters unknown. In M\ , by setting 82 = 0, 
the true linear discriminant LD = Xi is fitted. While fitting the true linear predictor 
LP = X 2 , equivalent to setting b\ = 0, we get M 2 . Note that all these models are 
“true”. Lor Ml the true value of bi is 0, and the true residual variance is 2, as against 
1 for Mq and M 2 . Linally, for any dataset with no information of parameters, we 
construct the estimated propensity variable LD*, and then fit the model M 3 . 

In each model Mj^, for k = 0,1,2,3, the least-squares estimator 8 k is unbiased for 
8 — 0.5. By the Gauss-Markov theorem and Corollary[T] 

Var(^) = Var( 53 ) > Var(^). 

Asymptotically, we have that Var.asy(^) = Var.aiy( 53 ) = 5/n, Var.aiy($ 2 ) = 4/n, 
and Var.asy(5i) = 10/«. It is indeed asymptotically less precise to adjust for PV 
than for its estimate in our model, which is in accordance with Lemma[3] 

Lor the sample analysis, 200 simulated datasets are generated, each of size 
n = 20. Shown in Pig. |5] are the empirical distributions of 8 k for all four models. 
Unsurprisingly, in terms of precision (from high to low), first comes the LP; next 
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is the estimated propensity variable LD* (or the estimated linear predictor LP*), or 
equivalently, X (= (Xi ,^ 2 )); and last comes the true propensity variable LD = Xi. 


Linear regression (homoscedasticity) [200 datasets] 


MO: Yon{T, X={X1,X2)’) 


Ml: Yon (T, LD=X1) 



M3: Y on {T, LD*) M2: Y on (T, LP=X2) 




Fig. 5 Estimates of ACE by regression on (clockwise): 1. Xi and X 2 . 2. population linear dis¬ 
criminant (propensity variable) Xi. 3. population linear predictor X 2 . 4. estimated linear dis¬ 

criminant (propensity vaiiable) LD* . 


4.3 Normal linear model (heteroscedasticity) 

Investigation in the homoscedasticity case is simple because PV is equivalent to 
LD, where linearity makes analysis straightforward. If covariance matrices of the 
conditional distribution of X for the two treatment groups are not identical, it turns 
out that adjusting for PV is not appropriate. 

Suppose now that, keeping all other distributional assumptions of S I4.2I un¬ 
changed, (I 2 TI) is re-specified as: 


X\{T,FT=(d)^XK{^T,ET) 
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with different covariance matrices £o and £i for T = 0 and T = 1. The distribution 
of X in all regimes then becomes 

X|TV-(1-0)^(Mo,^o) + 0^(Mi,A). 


Accordingly, 

log A = c + QD 

where 

c = + -Mo^o^Vo} 

and 

QD := (£-^fii-£^^MoyX-^X'(£^^-£o^)X. (30) 

QD is the quadratic discriminant including a linear term and a quadratic term of 
X, distinguishing the observational distributions of X given T = 0,1. We see that 
QD is a minimal treatment-sufficient covariate, and thus a PV but no longer a linear 
function of X. 

Because of the balancing property of PS (or PV), it now follows that ACE = 
E(SCEqd), with 

SCEqd = Ei(y I QD) -Eo(y I QD). 

Since QD is quadratic in X, Y is no longer linear in QD, the coefficient of T by ad¬ 
justing for PV (= QD) in the linear regression does not provide exact ACE. However, 
as computation of the expectations in the above formula is non-trivial, one might 
wish to replace EejT | TjQD) by the linear regression of T on (TjQD), and approx¬ 
imate the estimated ACE. Alternatively, one can take non-parametric approaches 
such as matching or subclassification on QD ll2^ . A number of papers on various 
matching approaches for causal effects have been collected in liTTll . More recently, 
statistical software becomes available for multivariate and PS matching in R ll^ . 

Now we discuss subclassifications and linear regressions based on QD, compared 
to linear regressions based on LP and ED. The linear discriminant is again in the 
form 

LD = (mi-Mo)'^-‘V, 

but with Z = (1 — 0)£() -(- 9£i, the sum of the weighted dispersion matrices of the 
two treatment groups. Eorm the formulae of QD and ED, we conclude that it is ED 
that comprises all variables with expectations depending on T. In a DAG represen¬ 
tation of this scenario, each of such variables must have an arrow pointing to T. 
However, the genuine PV (= QD) may depend on all the components of X, accord¬ 
ing to its quadratic term in (l30l) . Only with homoscedasiticity, PV is equivalent to 
ED and includes all variables associated with T. 

Although ED is not a sufficient covariate here, Theorem[3]still applies. It enables 
us to identify ACE from the linear regression of Y on (TjLD), which is equivalent 
to the linear regression of Y on {T,X). However, other authors claim that only if 
ED is highly correlated with PS, adjustment for ED works well in regressions ll22ll . 
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This may attribute to different scenarios considered, i.e., in our model Y is linearly 
related to X while non-linear in X in theirs. 


4.3.1 Simulations 

Simulated data is based on the above model, with the parameters: p = 20, d = 0, 
5 = 0.5, 0 = 0.5, b = (0,1,...,0)', iiQ = (0,...,0)', and = (0.5,0,0,...,0)'. 
Also, Xq is set, diagonally, to 0.8 for the first ten entries and to 1.3 for the remaining 
entries, and Xi the identity matrix. 

We then have, for the population, that 



and LP = X 2 . By estimating po and pi, £q and Xi from observed data, we can 
compute sample-based LD* and QD*. 

The results from 200 simulated datasets, each of size 500, are given in Fig. |6] 
The first three plots (clockwise) are from the linear regressions of Y on, respectively 
{T,X 2 ), (7’,LD), and (7’,QD). The last plot is the result of subclassification on PV ( 
= QD). That is, 500 observations are divided into 5 subclasses with equal number of 
observations in each, based on the values of QD. Within each subclass, units from 
the two treatment groups are roughly comparable such that the average difference 
of the response may be interpreted as the estimated SCE. Then ACE is estimated by 
summing over SCEs, each weighted by 1 /5. Note that the sample size has increased, 
since we must have at least one observation for each treatment in each subclass. 

Since LD and QD are practically unknown, they need be estimated from the 
observed data. Also, we do not know exactly the response predictors or the con- 
founders, full set of the observed X may have to be used for analysis. 

Eig. |7] gives the results from the same 200 datasets as above. Again, the first 
three plots are the results of linear regressions of Y, but on, respectively, {T^X), 
(r,LD*), and (r,QD*), where LD* and QD* are the sample linear and quadratic 
discriminants. Shown in the last plot is the result of subclassification on EPV ( = 
QD*). Unsurprisingly, by comparing the mean, standard deviation and mean squared 
error of the estimated ACE, regression of Y on (7’,LP = X 2 ) comes the best among 
all eight approaches in Eig. |6] and Eig. |7] Regressing on (T, X) is no better than 
regressing on (T, X 2 ) because all variables except X 2 in X are not predictors but 
noise of Y. In confirmation of the theory in S IT. 2. II regressing on LD*, rather than 
on X, has absolutely no effect on the estimated ACE. LD* outperforms LD because 
the latter does not contain the response predictor. Regressions on LD, QD, and on 
QD* are roughly equal, because apart fromX], the distributions of the remaining 19 
variables are identical, with rather small multipliers. Thus, the two quadratic terms 
in QD are roughly the same, and QD « jXi works approximately as a function of 
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Linear regression and subciassification (heteroscedasticity) [200 datasets] 

Regression on LP=X2 Regression on LD=5/9X1 



Subclassification on QD 


Regression on QD 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 


Fig. 6 Estimates of ACE by 4 different methods (clockwise): 1. Regression on population linear 
predictor LP = X 2 . 2. Regression on population linear discriminant LD = |Xi. 3. Regression 

on population quadratic discriminant (propensity variable) QD. 4. Subciassification on QD. 


a single variable Xi. Last comes subclassification on the quadratic PV, particularly 
when it is estimated. 


4.4 Propensity analysis in logistic regression 

As already investigated, propensity analysis in linear regression is fairly straightfor¬ 
ward. In many cases, however, response Y is not linear in X. We know that despite 
its name, generalised linear model (GLM) is not a linear model, because it is a 
non-linear function of the response that is linearly related to its predictors. Logistic 
regression is widely applied as a type of GLM if the response is binary. For exam¬ 
ple, doctors often record the outcome of a surgery on a patient as either “cured” or 
“not cured”. Next, a logistic model is used in our illustrative study. 
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Linear regression and subciassification (heteroscedasticity, sampie) [200 datasets] 
Regression on X Regression on LD* 



Fig. 7 Estimates of ACE by 4 different methods (clockwise): 1. Regression on sufficient covari¬ 
ate X. 2. Regression on sample linear discriminant LD*. 3. Regression on sample quadratic 
discriminant (propensity variable) QD*. 4. Subclassification on QD*. 


4.4.1 Model construction 

For simplicity, suppose that Y,T{1 x 1) and X (p x 1) are all binary and components 
of X are mutually independent. The joint distribution of {Ft,X,T,Y) is constructed 
as follows: 


X\FT^Ber{n) (31) 

logit{P 0 (r |X)} = c + fl'X (32) 

logit{Pf{Y \T,X)} = d+5T + b'X, (33) 

for / = 0,1,0; and tt is (p x 1). Property 0 and = 1 | 7’,A') S (0,1) are required 
such that (l3^ and (l3^ are well-defined. 

It is immediately seen that A" is a strongly sufficient covariate and 


ACE = E0{Ei(y I X)} -E0{Eo(y | X)} 

= E 0 {P 0 (F I 7’= 1,X)}-E0{P0(F I 7’ = 0,X)} 
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1 +e-(rf+5+fo'x) 



If the parameters are set as follows: 

/7 = 3, TT = (7ri,7r2,7r3)', a = (ai,a2,0)', b = (0,b2,hy, (34) 

then the response predictor is b 2 X 2 +I 73 X 3 and PV = aiXi + 122 ^ 2 • The conditional 
independence properties in our model can be read off Fig. [ 8 ] Then we have that 

Fig. 8 DAG for the logistic 
model 



logit{P(F I 7’,X2,^3)} = logit{P(T I T,X)} = d + 5T + b 2 X 2 + b 2 X 2 , 


and 


P{Y = 1\T,X^,X2)=P{Y = 1\T,X2) 



I T,X2} 
1 - 


1 g — (d+ST+b2X2+bp) 2 


\ (^/+57’+Zj2-^2) 


which does not depend on X\. And we have that 



(35) 


which is determined by d,5,b2,b2,7t2 and 71 : 3 . This extremely simple example, with 
only three components of X that are all binary, already results in a complicated form 
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for ACE, which would be even worse for high dimensional X and various types of 
variables. Next, instead of simulation, we conduct propensity analysis on real data. 


4.4.2 Propensity analysis of custodial sanctions study 

We illustrate the method with the aid of a study involving 511 subjects sentenced 
to prison in 1980 by the California Superior Court, and 511 offenders sentenced 
to probation following conviction for certain felonies m. These probationers were 
matched to the prisoners on county of conviction, condition offence type and risk 
of imprisonment quantitative index, so as to bring into the final sample the most 
serious offenders on probation and the least serious offenders sentenced to prison. 
The structure of this study corresponds to the (partially matched) case-control de¬ 
sign. In fact, this is analogous to the regression discontinuity designs where only 
observations near the cut-off of the risk score are included for causal effect analysis 
ini. We were to compare the average causal effect of judicial sanction (probation 
or prison) on the probability of re-offence. We specify variables as follows. 

• Treatment T: taking values 0 (probation) and 1 (prison); 

• Response Y: occurrence of recidivism (re-offence); 

• Pre-treatment variable X; including 17 carefully selected non-collinear variables 
that we can reasonably assume to make X a strongly sufficient covariate. 

Simple random multiple imputation by bootstrapping (R package: mi) was ap¬ 
plied to deal with missing data. We then considered two logistic regressions for the 
imputed data: 

1. y on {T,X), where X includes all the 17 variables. 

2. Y on {T, EPS), where EPS is the propensity score estimated from the logistic 
regression of T on all the 17 variables. In selecting these variables, we took ad¬ 
vantage of the possibility of trying various sets of covariates in the model, with¬ 
out inflating the type I error since these regressions do not involve the response 
information. The distribution densities of the two treatment groups are shown in 
Pig.m where we see a large overlapping area. 

Shown in Tab.[T]are the results. In this case, regression on the full set of X and on 
the estimated PS makes little difference, since the summary statistics from the two 
approaches are quite similar. Although the negative values of both the coefficients 
imply reduced re-offence for the imprisonment, they are not statistically significant. 


Table 1 Coefficients of judicial sanction (“prison” with respect to “probation”) from logistic re¬ 
gressions: 1. y on (r,X); 2. Y on {T, EPS) 


Regression 

Coefficient Standard Error p-value 

Y on (T,X) 

Y on (T, EPS) 

-0.1631 

-0.1713 

0.1579 0.3014 
0.1503 0.2545 
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0.0- 


2.0- 


0.5- 


1.5- 



0.25 


0.50 

Estimated propensity score 


0.75 


1.00 


Fig. 9 Distribution density comparison of the estimated propensity score: prison vs. probation. 

5 Double robustness 

Since the underlying response regression model (RRM): Y \ {X, T,Ft = 0) and the 
propensity model (PM): T \ {X,Ft = 9) are most likely unknown, one may specify 
parametric models based on previous experience. Moreover, as discussed in § [331 
a strongly sufficient covariate can be reduced by two alternative approaches from 
specified models, which enables estimating ACE by either method as follows: 

1. Adjustment for response predictors from correctly specified RRM; 

2. Adjustment for a PV (or PS) from correctly specified PM, either in response 
regression (if RRM is correctly specified), or otherwise, by non-parametric ap¬ 
proaches, e.g., matching. 

Due to lack of knowledge, it may well be that at least one model is misspecified. 
Little could be done if both models are wrong. Thus, our interest is to find a single 
estimator that produces a good estimate, given that at least one model is correct. 

ACE is normally estimated from the observed data. Suppose there are n individ¬ 
uals in an observational study. Observations {xi,ti,yj), where i = l,..,n, are gener¬ 
ated from the joint distribution {Xj,Ti,Yi) that are independent and identically dis¬ 
tributed. The estimation of the ACE requires estimates of the expected response for 
both treatment groups assigned by intervention. We have already demonstrated that, 
within the DT framework, ACE is identifiable from pure observational data if is a 
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strongly sufficient covariate. Here, X is again assumed to be strongly sufficient and 
thus satisfies Properties[T]|2]and[3] 


5.1 Augmented inverse probability weighted estimator 

To construct the augmented inverse probability weighted (AIPW) estimator, we dis¬ 
cuss two scenarios; 

• Correct RRM: Suppose that we know the RRM. For convenience, we write 
E,(y) as ji,, so 

=E0[E0(y |X,7’ = f)] (36) 

since X is strongly sufficient. Hence, in observational studies, E0(y \X,T = t) 
is an unbiased estimator of /f,, for f = 0,1. Consequently, E0(y \ X,T =1) — 
E0(y I X, r = 0) is an unbiased estimator of ACE. 

• Correct PM; Consider that the PM is correct, i.e., 7t(X) = Pi/i{T = 1 | X). 

Lemma 4. Suppose that the propensity model is correct and that X is a strongly 
sufficient covariate. Then 

= O’) 

where E0{^y} = ^i and E0{y^^y} = ^o- 
Proof. 


= E0{E0(y|X,7’=l)} = ^i by®. 

It automatically follows that Eoj-p^^y} = ^o- By Lemma|4] we see that, under 
the observational regime, and are unbiased estimators of pi and po 

respectively. 

One may have noticed that the two terms for ACE in dJTl i are similar with the 
Horvitz-Thompson (HT) estimator for sample surveys El. They are, however, dif¬ 
ferent in various aspects. The aim of HT estimator is to estimate the mean of a finite 
population Yi , ...An. denoted hy p = YJiLi Yi, from a stratified sample of size 
n drawn without replacement. Eor / = 1,...,A, let A, be binary sampling indicator 
(Ai = 1; unit i is in sample; 0; unit i is not in sample), and Tli be the probability that 
unit i being drawn in the sample. Then HT estimator is given by; 



(38) 











26 


Hui Guo, Philip Dawid and Giovanni Berzuini 


where tt, is pre-specified, and thus known in a sample survey design. But the propen¬ 
sity model 7r(X) in dJTl i is normally unknown. Moreover, HT estimator is applied 
to estimate the mean of a finite population, while ACE is used to estimate the mean 
of a superpopulation 0. HT estimator depends on pre-specified sampling scheme, 
but observations involved in ACE are generated from, and thus are dependent on, 
the joint distribution of (X, T,Y) in the observational regime. Nevertheless, both HT 
estimator and ACE are formed by means of the inverse probability weights l/nt 
or \/n{X). In fact, HT estimator is also termed as the inverse probability weighted 
(IPW) estimator. 

Sample surveys are closely related to missing data because the information is 
missing for those not sampled. So IPW estimator is frequently used in missing data 
models in the presence of partially observed response in El El. As counterfac- 
tuals are also regarded as missing data, IPW estimator can be used in the poten¬ 
tial response framework with half observed information, to make causal inference 
of treatment effect under the assumptions of “strongly ignorable treatment assign¬ 
ment”: (y(0),T(l))_LL7’ \ X and “no unobservedconfounders” ifTIE^. 

5.1.1 Augmented inverse probability weighted estimator 

Prom above discussion, there exists an unbiased estimator of ACE if either RRM 
or PM is correct. However, unknown RRM and PM makes it impossible to decide 
whether they are correct. Nevertheless, the augmented inverse probability weighted 
(AlPW) estimator can be constructed by combining the two models in the following 
alternative forms: 



(39) 


and similarly. 



(40) 


where m{-) and n{-) are arbitrary functions of X. As also indicated in its name, 
Ut AiPW '^he sum of the IPW estimator and an augmented term. 

Lemma 5. Suppose that X is a strongly sufficient covariate. The estimator aipw 
has the property of double robustness. That is, aipw unbiased estimator of 


^ In causal system, finite number of individuals in a study is called “population”, which can be 
regardes as a sample from a larger ’’superpopulation” of interest 
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the population mean given T = t by intervention, if either k[X) = piit{T = 1 \ X) or 
mlX)=E^lY\XJ = t). 

Proof. By similarity, we only give proof of Pi aipw- Consider the following two 
scenarios; 

Scenario 1: 7t{X) = pqi{T = 1 | X) and m{X) is an arbitrary function of X. 

It is easily seen that /Ii aipw unbiased, from the proof of Lemma|4] Since 
conditional on X, the last term in ( |39] | vanishes when we take expectation of fi ^ aipw 
in the observational regime. 

Scenario 2: m(X) = £ 0(7 \ X,T = 1) and k{X) is an arbitrary function of X. 
By (|39] |, we have that 


E(Mi,aipw) = + 


= E{E[m{X)\X]}+E{E[^^{Y-m{X))\X]} 



E[m{X)]+E{ 
E[m{X)] = Pi 


n{X) 


by®. 


Indeed, if either ;r(X) = p0(7’ = 1 | X) or m(X) = £@(7 | X, T = 1), not necessarily 
both. Pi Aipw unbiased. Consequently, 


AC£aipw — Mi.aipw ~ Mo.aipw- 


Theorem 4 Suppose that X is a strongly sufficient covariate. Then the AIPW esti¬ 
mator AC£aipw a doubly robust. 

To prove TheoremlH we simply apply the fact that both /ij aipw fio aipw 
doubly robust, so is their difference. 


5.2 Parametric models 

Suppose that we specify two parametric working models: the propensity working 
model n{X\a) and the response regression working model m{T,X\^). Then by 
(I39I 1 and (l40l i. we have, for the estimated £1 (£) and £o(£), that 



and 
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^ ”1 — 7’ 1 — r 

Mo,aipw = ^ ■ l-n{x‘-a) 


]m{Q,Xr,P)} (42) 


respectively. Therefore, by (|4TI) and (02]), we have that 

1-7)- 


ACEa/pw = fL\^Ipw — flOAIPW 

Ti 




f^{-n{Xi\a) \-K{Xi\a) 


]{Yi-m{Ti,Xi-M, (43) 


which is doubly robust, i.e., ACEa//>w is a consistent and asymptotically normal 
estimator of ACE if either of the working models is correctly specified. 


5.2.1 Discussion 

Kang and Schafer IH state that there are various ways to construct an estimator 
which is doubly robust. In our view, they are essentially the same, i.e., it must be in 
the same (or similar) form of AIPW estimator which is constructed by combining 
RRM and PM. Other constructions proposed in M are just variations of AIPW 
estimator. Eor example, in dSSl l. instead of using N as denominator for each unit, 
they use normalised weights Such normalised weights are especially useful 

for precision improvement in the case that subjects with very small probabilities of 
being sampled are actually drawn from the population. Because if N is used as the 
weight, these subjects will influence the estimated average response enormously, 
and consequently, result in poor precision. 

Kang and Schafer IH have also investigated the precision performance of an 
doubly robust estimator when both 7t{X) and m{X) are moderately misspecified. 
They state that “in at least some settings, two wrong models are not better than one”. 
This seems obvious because the performance of this estimator will depend on the 
degree of misspecification of both models. This can be easily analysed in theory but 
far more complicated in practice, as one can not have a good control of specifying 
models 7t{X) and m{X) based on limited observed data and previous experience (if 
any). Therefore, it would be difficult to measure to what extent the specified models 
are different from the true ones. 


5.3 Precision of ACEaipw 

5.3.1 Known propensity score model 

We already see that ACEaipw is an unbiased and doubly robust estimator of 
ACE. Then how can we choose an arbitrary function m{Xi) to minimise the vari¬ 
ance of ACEaipw given correct PM? Suppose that in an experiment, we know 
n{Xi) = P(7] = 1 I Xi). Then in terms of the variance, we have that 
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Var(ACE^w) = Var{«-1[£( 


= « ^{Var[j 2 ( 


\-Ti 


K{Xi) l-7Z{Xi) 


l-Ti 


\ 7Z{Xi) l-7Z(Xi) 


){Yi-m{Xi)]} 


)i^'] 


VarE( 


T, 


-2Cov[^( 


l-Tj 

^^^n{Xi) l-n{Xi) 

T, 


)m{Xi)] 


IT’. n 7 ’. 

' )Yi,U: 


l-Ti 


n{Xi) l-n{Xi) n{Xi) l-7z{Xi) 


)m{Xi)]} 


= «-2|y^r(ACEHr)+E[£ 


-iXi) 


- 2 EE 


t\^iXt){l-7t{Xi)) 
m{Xi)nii m(Xi){iiu-lii). 


f,7tiXi)il-7tiXi)) il-7tiXt)r 


= «-2{Var(ACEHr)+E[^ 


-iXi) 


- 2 E{ 


Mb' 


\ 7t(Xi){l - 7t(Xi)) 

Mb - Mi 


t,^7z{X,){l-n{Xi)) (l-7r(X,))2 


}miXi)]}, 


where Mb = E 0 (E | Xj, Ti=\) and Mi = E 0 (E | Xi). 

By minimising the quadratic function of m{Xi) in the expectation, it follows that 


m{Xi) = [l-n{Xi)]iiu + n{Xi)iM)i 

= [1 - K{Xi)]E^(Yi I X,-, 7i- = 1) + K{Xi)E^{Yi I Xi, Ti = 0), (44) 


which minimises the variance of PXIEaipw among all functions of Xi. In fact, if 
either n{Xi) = pq,{Ti = 1 | Xi), or (l44li holds, ACEaipw is unbiased, and thus is 
doubly robust. 

Let m\{Xi) and mo{Xi) denote the regressions of Y on Xi for the two treat¬ 
ment groups in the observational regime. It is unnecessary to require that mi (Xi) = 
E 0 (E \Xi,Ti = 1) and that mo(Xi) = E 0 (y,' | X,-, 7) = 0). As long as m(Xi) is specified 
as the sum of the weighted expectations as in the form of (l44l i. m{Xi) minimises the 
variance of the estimated ACE. 

Same result is obtained in ||2^ as (l44l i. by minimising a weighted mean squared 
error of m{Xi). We now discuss an alternative approach provided in 1^ . Let f, 
denote a weighted response in a form as follows; 


?/=[{ 



l}7^ + { 


1 

1 - n{Xi) 




Then by (l44l i. it follows that 


m{Xi) 


1 - Tt{Xi) 

n{Xi) 


E^[Yi\XiJi 


1)P(7’ = 1 |X) 


( 45 ) 





















30 


Hui Guo, Philip Dawid and Giovanni Berzuini 


niXi) 


-E0(F, |^i,7^ = 0)P(7’=0|X) 


1 - n{Xt) 

IX,) + y^e.((i - T,)y | x| 
’'** (i-XMxlx) 


= E,nl^y 


7t{Xi) 1 - 7t(Xi) 

= E,(y, lx), 


where m{Xi) is obtained by simply regressing Yj on Xi, rather than regressing F, 
on both Xi and Ti. However, an obvious disadvantage of this approach is its low 
precision. When individuals with the PS close to 0 are actually in the treatment 
group and/or those with the PS close to 1 are actually assigned to the control group, 
the weights 1 / n{Xi) or 1/(1 — 7t{Xi)) of these units will be very large, which leads 
to corresponding responses being highly influential, which is dangerous. In fact, it 
may be even worse than the HT estimator as we will see next. 

To show the difference of these approaches, we have implemented Monte Carlo 
computations for four estimators of AC^aipw'- 

1. by (l44l) with E 0 (T \Xi,Ti = \) and E 0 (E | Xi,Ti = 0) estimated by regressing F, 
on {Xi,Ti). 

2. by (l44l) with E 0 (F | Xt^ = 1) and E 0 (F | Xi, Ti = 0) estimated by regressing F,- 
on Xi for the treatment group and control group separately. 

3. by Horvitz-Thompson approach, i.e. without covariate adjustment. 

4. by regression of F, on Xi as in (l46l l. 

The results of simulated 100 datasets are shown in Fig. [TO] The first two ap¬ 
proaches give similar results. That is, we can estimate E 0 (F, | Xi,Ti = 1) and 
E 0 (F, I Xi,Ti = 0 ) either simultaneously from the response regression on the treat¬ 
ment and X, or separately from the response regression only on X for each treatment 
group. As expected, the last approach generates several extreme estimates relative 
to others, which makes its variance even much larger than that of the HT estimator. 


5.3.2 Known response regression model 

Suppose that E 0 (F, | 7] = 1) and E 0 (T | Tj- = 0) are both known but not the 

PM. Then the AIPW estimator can be constructed as: 


where 
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Estimated ACE (100 datasets) 



Fig. 10 Precision of the estimated ACE based on: (1) specified model for E 0 (i^' | X;, 7;); (2) speci¬ 
fied models for ^{Yi \ X,) separately for both groups; (3) Horvitz-Thompson estimator; (4) regres¬ 
sion of Yi on X;. 


m{Xi) = (1 I XiJi = l)+g{Xi)EiYi \ Xiji = 0), 

and g{Xi) is an arbitrary function of X,. 

So ACEaipw is unbiased and its variance is computed as follows. 

Var(ACE^w) = Var{n-‘ ^ " m{Xi)]} 

= „-2Var{f - [1 -^(X,))Mi, +^(^,)Mo/]} 

g{Xi) ^-g{xo 

= n-^YaT{f^{nu - Moi) + 

= n-^VaT{f^{fiu-lioi)} 


+ n-'E{ Var[f ^ (E- -flu)- I > 


> n ^Var{^(^i/-^o,)} = Var(ACEffffM)- 
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Hence, we conclude that, for each individual, if the conditional expectations of the 
response given Xi for both groups are known or correctly specified, then ACE^/pw 
will be less precise than the estimated ACE from the response regressions. 


5.3.3 Discussion 

If the PM is known, then the variance of ACEaipw is minimised when m{Xi) is 
specified as in (l44l) - where separate specification of mi{Xi) and mo(X,) are not 
necessary. Rubin and van de Laan ll2^ has introduced a weighted response serving 
as an alternative, but we have shown, by simulations, that it could result in large 
variance of the estimated ACE and possibly larger than the HT estimator. In the case 
that the RRM is correctly specified, i.e., mi{Xi) = E 0 (y,' | Xi,Tj = 1) and mo(X,) = 
E 0 (y, I X, , Ti = 0), then these two models rather than the AIPW estimator should be 
used to estimate ACE for higher precision of the estimator. 


6 Summary 

In this chapter, we have addressed statistical causal inference using Dawid’s decision- 
theoretic framework within which assumptions are, in principle, testable. Through¬ 
out, the concept of sufficient covariate plays a crucial role. We have investigated 
propensity analysis in a simple normal linear model, as well as in logistic model, 
theoretically and by simulation. Adding weight to previous evidence Eol EU [n 
[331 im, our results show that propensity analysis does little in improving estimation 
of the treatment causal effect, either unbiasedness or precision. However, as part of 
the augmented inverse probability weighted estimator that is doubly robust, correct 
propensity score model helps provide unbiased average causal effect. 


Appendix 

R code of simulations and data analysis 


################################################################ 

Figure 5: Linear regression (homoscedasticity) 


1. Y on X; 

2. Y on population linear discriminant / propensity variable LD; 

3. Y on sample linear discriminant / propensity variable LD*; 

4 . Y on population linear predictor LP. 

################################################################ 
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## set parameters 

p <- 2 

delta <- 0.5 
phi <- 1 
n <- 20 

alpha <- matrix(c(1,0), nrow=l) 
sigma <- diag(l, nrow=p) 
b <- matrix(c(0,1), nrow=p) 

## create a function to compute ACE from four linear regressions 
ps <- function (r) { 

# data for T, X and Y from the specified linear normal model 

set.seed(r) 

.Random.seed 
t <- rbinom{n, 1, 0.5) 

require(MASS) 
m <- rep{0, p) 

ex <- mvrnorm{n, mu=m, Sigma=sigma) 

X <- t%*%alpha + ex 

ey <- rnorm{n, mean=0, sd=sqrt(phi)) 
y <- t*delta + x%*%b + ey 

# calculate the true and sample linear discriminants 

id.true <- x%*%solve(sigma)%*%t(alpha) 
pred <- x%*%b 

dl <- data.frame(x, t) 
c <- coef(Ida(t~.,dl)) 
id <- x%*%c 

# extract estimated average causal effect (ACE) 

# from the four linear regressions 

dhat.pred <- coef(summary(Im(y~1+t+pred)))[2] 
dhat.x <- coef(summary(Im(y~t+x)))[2] 
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dhat.Id <- coef(summary(Im(y~t+ld)))[2] 
dhat.ld.true <- coef(summary(Im(y~t+ld.true)))[2] 

return(c(dhat.X, dhat.Id, dhat.ld.true, dhat.pred)) 


## estimate ACE from 200 simulated datasets 

## compute mean, standard deviation and mean square error of ACE 

g <- rep(0, 4) 
for (r in 31:230) { 
g <- rbind(g, ps(r)) 

} 

g <- g[-1,] 

d.mean <- 0 
d.sd <- 0 
mse <- 0 

for (i in 1:4) { 

d.mean[i] <- round(mean(g[,i]),4) 
d.sd[i] <- round(sd(g[,i]),4) 

mse[i] <- round((d.sd[i])"2+(d.mean[i]-delta)"2, 4) 

} 


## generate Figure 5 

par(mfcol=c(2,2), oma=c(1.5,0,1.5,0), las=l) 

main=c("M0: Y on (T, X=(X1, X2)')", "M3: Y on (T, LD*)", 

"Ml: Y on (T, LD=X1)", "M2: Y on (T, LP=X2)") 

for (i in 1:4){ 

hist(g[,i], br=seq(-2.5, 2.5, 0.5), xlim=c(-2.5, 2.5), ylim=c(0,80), 
main=main[i], col.lab="blue", xlab="", ylab="",col="magenta") 
legend (-2.5, 85, c (paste ( "mean = ", d.mean [i] ) , pasteC'sd = ",d.sd[i]), 
paste("mse = ",mse[i])), cex=0.85, bty="n") 

} 

mtext(side=3, cex=1.2, line=-l.l, outer=T, col="blue", 

text="Linear regression (homoscedasticity) [200 datasets]") 

dev.copy(postscript,"Irpvpdecmbook.ps", horiz=TRUE, paper="a4") 
dev.off() 
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########################################################################### 

Linear regression and subclassification (heteroscedasticity) 


Figure 6: 

1. Regression on population linear predictor LP; 

2. Regression on population linear discriminant LD; 

3. Regression on population quadratic discriminant / propensity variable QD 

4. Subclassification on QD. 


Figure 7: 

1. Regression on sample linear predictor LP*; 

2. Regression on sample linear discriminant LD*; 

3. Regression on sample quadratic discriminant / propensity variable QD*; 

4. Subclassification on QD*. 

########################################################################### 


## set parameters 

p <- 20 

d <- 0 

delta <- 0.5 
phi <- 1 
n <- 500 


a <- matrix(rep(0,p), nrow=l) 

alpha <- matrix(c(0.5,rep(0,p-1)), nrow=l) 

sigmal <- diag(l, nrow=p) 

sigmaO <- diag(c(rep(0.8, 10), rep (1.3, 10)), nrow=p) 
b <- matrix(c(0, 1, rep{0,p-2)), nrow=p) 


## create a function to compute ACE from eight approaches 
ps <- function (r) { 

# data for T, X and Y from the specified linear normal model 

set.seed(r) 

.Random.seed 
pi <- 0.5 

t <- rbinom{n, 1, pi) 
nO <- 0 
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for (i in 1:n) { 

if (t[i]==0) 
nO <- nO+1 

} 

t <- sort (t, decreasing=FALSE) 
mul <- a+alpha 
muO <- a 

require(MASS) 
m <- rep ( 0, p) 

exO <- mvrnorm(nO, mu=m, Sigma=sigmaO) 
exl <- mvrnorm( (n-nO), mu=m, Sigma=sigmal) 

a <- matrix(rep(a, n), nrow=n, byrow=TRUE) 
xO <- a[(l:nO),] + t[1:nO]%*%alpha + exO 
xl <- a[(nO+l):n,] + t[(nO+l):n]%*%alpha + exl 
X <- rbind(xO, xl) 

ey <- rnorm(n, mean=0, sd=sqrt(phi)) 
d <- rep (d, n) 

y <- d + t*delta + x%*%b + ey 

# calculate linear discrimant, quadratic discrimant, for population 

# and for sample, extract estimated ACE from linear regressions 

id <- x%*%solve(pi*sigmal+pi*sigmaO)%*%t(alpha) 
dl <- data.frame(x, t) 
c <- coef(Ida(t~.,dl)) 

Id.s <- x%*%c 

zl <- x%*%(solve(sigmal)%*%t(mul) - solve(sigmaO)%*%t(muO)) 
z2 <- 0 

for (j in 1:n){ 

z2[j] <- - 1/2*matrix(x[j,], nrow=l)%*%(solve(sigmal) 

- solve(sigmaO))%*%t(matrix(x[j,], nrow=l)) 

} 

qd <- zl+z2 

dhat.x2 <- coef(summary(Im(y~1+t+x[,2])))[2] 
dhat.ld <- coef(summary(Im(y~1+t+ld)))[2] 
dhat.qd <- coef(summary(Im(y~1+t+qd)))[2] 

mn <- aggregate(dl, list(t=t), FUN=mean) 
mO <- as.matrix(mn[1, 2:(p+l)]) 
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ml <- as.matrix(mn[2, 2:(p+l)]) 
vO <- var(xO) 
vl <- var(xl) 

cl <- solve(vl)%*%t(ml)-solve(vO)%*%t(mO) 

zl. s <- x%*%cl 

c2 <- solve(vl)-solve(vO) 

z2.s <- 0 

for (1 in 1:n){ 

z2.s[i] <- -l/2*matrix(x[1,], nrow=l)%*%c2%*%t(matrix(x[1, 

} 

qd.s <- zl.s + z2 . s 

dhat.X <- coef(summary(Im(y~1+t+x)))[2] 
dhat.ld.s <- coef(summary(Im(y~1+t+ld.s)))[2] 
dhat.qd.s <- coef(summary(Im(y~1+t+qd.s)))[2] 

# extract estimated ACE from subclassification 

d2 <- data.frame(cbind(qd, qd.s, y, t)) 

tml <- vector{"list", 2) 
tmO <- vector{"list", 2) 
te.qd <- 0 

for (k in 1:2) { 

d3 <- d2[, c (k,3,4)] 

d3 <- split{d3[order{d3[,1]), ], rep(l:5, each=100)) 

tm <- vector{"list", 5) 
for {j in 1:5) { 

tm[[j]] <- aggregate{d3[[j]], list(Stratum=d3[[j]]$t), 
tml[[k]][j] <- tm[[j]][2,3] 
tmO[[k]][j] <- tm[[j]][1,3] 

} 

te.qd[k] <- sum(tml[[k]] - tm0[[k]])/5 

} 


# return estimated ACE from the eight approaches 

return(c(dhat.x2, te.qd[l], dhat.Id, dhat.qd, 
dhat.x, te.qd[2], dhat.ld.s, dhat.qd.s)) 

} 


], nrow=l)) 


FUN=mean) 
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## estimate ACE from 200 simulated datasets 

## compute mean, standard deviation and mean square error of ACE 

g <- rep(0, 8) 
for (r in 31:230) { 

g <- rbind{g, ps{r)) 

} 

g <- g[-1,] 

d.mean <- 0 
d.sd <- 0 
d.mse <- 0 

for (i in 1:8) { 

d.mean[i] <- round(mean(g[,i]),4) 
d.sd[i] <- round(sd(g[,i]),4) 

d.mse[i] <- round((d.sd[i])^2+(d.mean[i]-delta)^2, 4) 

} 


## generate Figure 6 

par(mfcol=c{2,2), oma=c{1.5,0,1.5,0), las=l) 

main=c{"Regression on LP=X2Subclassification on QD", 

"Regression on LD=5/9X1","Regression on QD") 
for (i in 1:4){ 

hist(g[,i], br=seq(-0.1, 1.1, 0.1), xlim=c(-0.1, 1.1), ylim=c(0,80), 
main=main[i], col.lab="blue", xlab="", , ylab="", col="magenta") 
legend (-0.2,85, c (paste { "mean = ", d. mean [ i ] ) , pasteC'sd = ",d.sd[i]) 
paste("mse = ",d.mse[i])), cex=0.85, bty="n") 

} 

mtext(side=3, cex=1.2, line=-l.l, outer=T, col="blue", 
text="Linear regression and subclassification 
(heteroscedasticity) [200 datasets]") 

dev.copy(postscript,"pslrsubtruebook.ps", horiz=TRUE, paper="a4") 
dev.off 0 


## generate Figure 7 

main=c("Regression on X","Subclassification on QD*", 

"Regression on LD*", "Regression on QD*") 
for (i in 1:4){ 

hist(g[,i+4], br=seq(-0.1, 1.1, 0.1), xlim=c(-0.1,1.1), ylim=c(0,80) 
main=main[i], col.lab="blue", xlab="", ylab="", col="magenta") 
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legend (-0.2,85, c (paste { "mean = ", d. mean [ i + 4 ] ) , pasteC'sd = " , d. sd [ i + 4 ] ) , 
paste("mse = ",d.mse[i+4])), cex=0.85, bty="n") 

} 

mtext{side=3, cex=1.2, line=-l.l, outer=T, col="blue", 
text="Linear regression and subclassification 
(heteroscedasticity, sample) [200 datasets]") 

dev.copy(postscript,"pslrsubbook.ps", horiz=TRUE, paper="a4") 
dev.off 0 


###################################################################### 

Figure 9 and Table 1: Propensity analysis of custodial sanctions study 


1 . Y on X; 

2. Y on population linear discriminant / propensity variable LD; 

3. Y on sample linear discriminant / propensity variable LD*; 

4 . Y on population linear predictor LP. 

###################################################################### 

## read data, imputation by bootstrapping for missing data 

dAll = read. CSV (file="pre_impute_data. CSV" , as.is=T, sep=',', header=T) 

set.seed(100) 

.Random.seed 
library(mi) 

data.imp <- random.imp(dAll) 


## estimate propensity score by logistic regression 

glm.ps<-glm(Sentenced_to_prison~ 

Age_at_lst_yuvenile_incarceration_y + 
N_prior_adult_convictions + 

Type_of_defense_counsel + 

Guilty_plea_with_negotiated_disposition + 
N_jail_sentences_gr_90days + 
N_juvenile_incarcerations + 
Monthly_income_level + 

Total_counts_convicted_for_current_sentence + 
Conviction_offense_type + 

Recent_release_from_incarceration_m + 
N_prior_adult_StateFederal_prison_terms + 
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Offender_race + 

Offender_released_during_proceed + 

Separated_or_divorced_at_t iine_of_sentence + 

Li ving_situation_at_tiine_of_of fence + 

Status_at_tiine_of_of fense + 

Any_victiins_feinale, 

data = data.imp, familY=binomial) 


summary(glm.ps) 

eps <- predict(glm.ps, data = data.imp[, -1], type='response') 
d.eps <- data.frame(data.imp, Est.ps = eps) 

## Figure 9: densities of estimated propensity score (prison vs. probation) 
library(ggplot2) 

d.plot <- data.frame(Prison = as.factor(data.imp$Sentenced_to_prison), 
Est.ps = eps) 
pdf("ps.dens.book.pdf") 

ggplot(d.plot, aes(x=Est.ps, fill=Prison)) + geom_density(alpha=0.25) + 
scale_x_continuous(name="Estimated propensity score") + 
scale_y_continuous(name="Density") 
dev.off() 


## logistic regression of the outcome on all 17 variables 

glm.y.allx<-glm(Recidivism" 

Sentenced_to_prison + 

Age_at_lst_yuvenile_incarceration_y + 
N_prior_adult_convictions + 

Type_of_defense_counsel + 

Guilty_plea_with_negotiated_disposition + 
N_jail_sentences_gr_90days + 
N_juvenile_incarcerations + 
Monthly_income_level + 

Total_counts_convicted_for_current_sentence + 
Conviction_offense_type + 

Recent_release_from_incarceration_m + 
N_prior_adult_StateFederal_prison_terms + 

Offender_race + 

Offender_released_during_proceed + 
Separated_or_divorced_at_time_of_sentence + 

Living_situation_at_time_of_offence + 


Sufficient Covai'iate, Propensity Variable and Doubly Robust Estimation 


41 


Status_at_tiine_of_of fense + 

Any_victiins_feinale, 

data = d.eps, family=binomial) 

summary(glm.y.allx) 


## logistic regression of the outcome on the estimated propensity score 

glm.y.eps<-glm(Recidivism ~ Sentenced_to_prison + Est.ps, 
data = d.eps, family=binomial) 
summary{glm.y.eps) 
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